The DARPA Machine Reading Program - Encouraging Linguistic and Reasoning Research with a Series of Reading Tasks
نویسندگان
چکیده
The goal of DARPA’s Machine Reading (MR) program is nothing less than making the world’s natural language corpora available for formal processing. Most text processing research has focused on locating mission-relevant text (information retrieval) and on techniques for enriching text by transforming it to other forms of text (translation, summarization) – always for use by humans. In contrast, MR will make knowledge contained in text available in forms that machines can use for automated processing. This will be done with little human intervention. Machines will learn to read from a few examples and they will read to learn what they need in order to answer questions or perform some reasoning task. Three independent Reading Teams are building universal text engines which will capture knowledge from naturally occurring text and transform it into the formal representations used by Artificial Intelligence. An Evaluation Team is selecting and annotating text corpora with task domain concepts, creating model reasoning systems with which the reading systems will interact, and establishing question-answer sets and evaluation protocols to measure progress toward this goal. We describe development of the MR evaluation framework, including test protocols, linguistic resources and technical infrastructure. 1. Background and Conceptual Framework To advance research towards the end goal of general, lightly-trained systems which read text “in the wild” well enough to support automated reasoning tasks as well as to learn to read better, three independent Reading Teams are building universal text engines which will capture knowledge from naturally occurring text and transform it into the formal representations used by Artificial Intelligence. An Evaluation Team is selecting and annotating text corpora with task domain concepts, creating model reasoning systems with which the reading systems will interact, and establishing question-answer sets and evaluation protocols to measure progress toward this goal. While the Reading teams will be implementing a full range of natural language processing (NLP) techniques, significant emphases in the research will be in areas of machine learning and knowledge representation. Aligning a reading system’s internal linguistic representation of a text corpus with the semantics of an external reasoning system – in fact, learning that alignment “on the fly” during reading – is a major challenge for the program. This paper describes development of the MR evaluation framework, including test protocols, linguistic resources and technical infrastructure. The MR program is structured around a roadmap of linguistic and semantic capabilities, e.g. dealing with anaphora, causal and modal language, temporal and spatial reasoning, sentiment and belief. Over the course of the five-year program, the Evaluation Team will provide a series of graded Reading Tasks (described below), which present the reading systems with increasingly difficult challenges. At each phase, increasingly complex linguistic and reasoning tasks are combined with increased performance expectations in query answering, expanded corpus volume, and reduced time to prepare and adapt the systems to new tasks. 2. Readability Task The program begins with an interesting but somewhat orthogonal challenge: assessing the “readability” or quality of a text passage. This is motivated by the belief that a computer system with the ability to extract a set of language features diverse enough and of high enough quality to assess readability at least as well as humans will be primed to take on the challenges of machine reading. Measures of readability have been proposed and are in common use by educators and editors to estimate grade level or comprehension. These measures are usually based on surface features, such as sentence length or syllable count (Flesch, 1948; Kincaid, et.al., 1975), although some incorporate deeper concepts, such as word or sentence “complexity” (Gunning, 1952). However, recent research (Pitler & Nenkova, 2008) has shown that surface features are not well correlated with judgments by adult readers, whereas several syntactic, semantic, and discourse features are. Reflecting the goal of the MR program, we define readability as a subjective judgment of how easily a reader can extract the information the writer or speaker intended to convey. We draw texts from a diverse range of genres which a machine reading system is likely to employ: newswire stories, weblogs, newsgroups/forum posts, Wikipedia entries, broadcast transcripts and closed caption text, and even some machine translation output. To separate the rating of texts from the measurement of human performance, we employ two panels. A panel of judges with expertise in reading tasks such machine translation post-editing produces a gold standard judgment – a rating from 1 to 5 for each passage – and their mean defines a reference rating for that passage. A separate, “novice” panel of typical readers of English then rates the passages, to estimate the variable performance of humans at this task. Machine performance is expected to meet or exceed the performance of individual novice readers with statistical significance. 2.1 Data and Human Judgments Genre and passage selection and human test protocols are designed to ensure minimal bias from such factors as fatigue, topic familiarity, genre-specific surface features, and inter-passage ranking interference. Individual reading passages are selected by first preparing a large pool of candidate documents for each genre. Candidates are manually vetted to exclude inappropriate passages (e.g. not in English); the pool is then automatically downsampled to produce the targeted number of passages for each genre. Selected passages are processed to standardize formatting and remove any genre-specific surface features like newswire headlines or Wikipedia-style markup. Passages are also assigned to a general topic category (current events, sports, etc.) prior to assessment. The prepared passages are then assigned to expert and naive judges for assessment via a web-based user interface. To reduce assessor fatigue, passages are grouped into sets of 10, called rounds. For each round, assessors first read each passage and give a rating of 1-5 "stars" to indicate how readable the passage is. After rating all passages in a round, assessors then provide a rank ordering of those passages in terms of their overall readability. Following each round expert judges are also asked to state the criteria they used to determine the passage ratings/rankings. This is an open list, not a set of pre-supplied criteria. Assessors continue rating, ranking and (experts-only) listing criteria for each round until all passages have been judged. All passages are assessed by multiple independent naive and expert judges. Presentation order within and across rounds is randomized, and each round is roughly balanced for genre and topic. A similar balance is maintained between training and testing data sets. 2.2 Evaluation, Metrics, and Results To reduce the chance that extraneous factors might dominate the test, such as variability in the expert ratings or bias from the experts’ experience with previous linguistic annotation tasks, we devised several scoring metrics with different statistical characteristics to compare machine to (novice) human judgments. Metric 1 – Score Difference. This metric measures how much closer than the average novice the machine comes to the gold standard rating. We use the mean of the expert panel ratings as the reference score s(g,t) for text t. Let s(j,t) be the score of the j th novice judge. Then
منابع مشابه
Designing a structured linguistic play therapy program for reading disorder: Basics and Strategies
Background & Purpose: Linguistic play therapy is a structured intervention based on the linguistic core of reading that can be modified and implemented for students with reading problems and disorders. The purpose of this study is to provide theoretical foundations and solutions and principles of linguistic game therapy design to empower teachers and counselors related to educational service...
متن کاملCross-Linguistic Transfer Revisited: The Case of English and Persian
The present study sought to investigate the evidence for cross-linguistic transfer in a partial English immersion and non-immersion educational setting. To this end, a total of 145 first, third and fifth graders in a partial English immersion program and 95 students from the same grade levels in a non-immersion program were chosen. Six different English and Persian tests were administered: the ...
متن کاملDesigning reading tasks to maximise vocabulary learning
Most vocabulary learning should occur incidentally through listening and reading. This is one of the reasons why a substantial extensive reading program is an important part of an English course. Extensive reading requires the learners to do large quantities of reading using material that is at the right level for them. Vocabulary learning occurs through th...
متن کاملTHE IMPACT OF LINGUISTIC AND EMOTIONAL INTELLIGENCE ON THE READING PERFORMANCE OF IRANIAN EFL LEARNERS
Following innovations in intelligence and its radical changes from a unitary concept (IQ) to a multi-dimensional conceptualization, i.e. multiple intelligences and the need to design classroom activities based on the L2 learners’ cognitive styles, this study examined the impact of linguistic intelligence and emotional intelligence on the reading comprehension ability of the Iranian EFL learners...
متن کاملEffectiveness of the Linguistic Plays on Improving the Reading Skills of Educable Mental Retarded Preliminary School Students
Abstract The present study has been conducted with the purpose of exploring linguistic plays in increasing reading skill among retarded students. The kind of the study is quasi-experimental with pre-test and post-test, being conducted among all retarded students studying in second grade of elementary schools at Mashhad. The sample included 30 subjects, randomly selected and assigned as experim...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010